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Abstract — A new class of distances appropriate for mea- 
suring similarity relations between sequences, say one type 
of similarity per distance, is studied. We propose a new 
"normalized information distance", based on the noncom- 
putable notion of Kolmogorov complexity, and show that it 
is in this class and it minorizes every computable distance in 
the class (that is, it is universal in that it discovers all com- 
putable similarities). We demonstrate that it is a metric 
and call it the similarity metric. This theory forms the foun- 
dation for a new practical tool. To evidence generality and 
robustness we give two distinctive applications in widely di- 
vergent areas using standard compression programs like gzip 
and GenCompress. First, we compare whole mitochondrial 
genomes and infer their evolutionary history. This results in 
a first completely automatic computed whole mitochondrial 
phylogeny tree. Secondly, we fully automatically compute 
the language tree of 52 different languages. 

Index Terms — dissimilarity distance, Kolmogorov com- 
plexity, language tree construction, normalized information 
distance, normalized compression distance, phylogeny in 
bioinformatics, parameter-free data-mining, universal sim- 
ilarity metric 



I. Introduction 

How do we measure similarity — for example to determine 
an evolutionary distance — between two sequences, such as 
internet documents, different language text corpora in the 
same language, among different languages based on ex- 
ample text corpora, computer programs, or chain letters? 
How do we detect plagiarism of student source code in as- 
signments? Finally, the fast advance of worldwide genome 
sequencing projects has raised the following fundamental 
question to prominence in contemporary biological science: 
how do we compare two genomes [30], [ST]? 

Our aim here is not to define a similarity measure for 
a certain application field based on background knowledge 
and feature parameters specific to that field; instead we 
develop a general mathematical theory of similarity that 
uses no background knowledge or features specific to an 
application area. Hence it is, without changes, applicable 
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to different areas and even to collections of objects taken 
from different areas. The method automatically zooms in 
on the dominant similarity aspect between every two ob- 
jects. To realize this goal, we first define a wide class of 
similarity distances. Then, we show that this class con- 
tains a particular distance that is universal in the following 
sense: for every pair of objects the particular distance is less 
than any "effective" distance in the class between those two 
objects. This universal distance is called the "normalized 
information distance" (NID), it is shown to be a metric, 
and, intuitively, it uncovers all similarities simultaneously 
that effective distances in the class uncover a single simi- 
larity apiece. (Here, "effective" is used as shorthand for a 
certain notion of "computability" that will acquire its pre- 
cise meaning below.) We develop a practical analogue of 
the NID based on real-world compressors, called the "nor- 
malized compression distance" (NCD), and test it on real- 
world applications in a wide range of fields: we present the 
first completely automatic construction of the phylogeny 
tree based on whole mitochondrial genomes, and a com- 
pletely automatic construction of a language tree for over 
50 Euro- Asian languages. 

Previous Work: Preliminary applications of the cur- 
rent approach were tentatively reported to the biological 
community and elsewhere ^JJ, [3TJ, [33]. That work, and 
the present paper, is based on information distance |ii?>| . 
[3], a universal metric that minorizes in an appropriate 
sense every effective metric: effective versions of Ham- 
ming distance, Euclidean distance, edit distances, Lempel- 
Ziv distance, and the sophisticated distances introduced in 
[Tfij . [HHj- Subsequent work in the linguistics setting, 0, 
[3J, used related ad hoc compression-based methods, Ap- 
pendix^] The information distance studied in |32| . [6'6\ . 
[3], [HJJ, an d subsequently investigated in [25], jSH], [SI 
|49|. is defined as the length of the shortest binary pro- 
gram that is needed to transform the two objects into each 
other. This distance can be interpreted also as being pro- 
portional to the minimal amount of energy required to do 
the transformation: A species may lose genes (by deletion) 
or gain genes (by duplication or insertion from external 
sources), relatively easily. Deletion and insertion cost en- 
ergy (proportional to the Kolmogorov complexity of delet- 
ing or inserting sequences in the information distance), and 
aspect that was stressed in [32]. But this distance is not 
proper to measure evolutionary sequence distance. For ex- 
ample, H. influenza and E. coli are two closely related sister 
species. The former has about 1,856,000 base pairs and the 
latter has about 4,772,000 base pairs. However, using the 
information distance of [3], one would easily classify H. in- 
fluenza with a short (of comparable length) but irrelevant 
species simply because of length, instead of with E. coli. 
The problem is that the information distance of @] deals 
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with absolute distance rather than with relative distance. 
The paper |48j defined a transformation distance between 
two species, and |2U defined a compression distance. Both 
of these measures are essentially related to K(x\y). Other 
than being asymmetric, they also suffer from being abso- 
lute rather than relative. As far as the authors know, the 
idea of relative or normalized distance is, surprisingly, not 
well studied. An exception is |52| . which investigates nor- 
malized Euclidean metric and normalized symmetric-set- 
diffcrcncc metric to account for relative distances rather 
than absolute ones, and it does so for much the same rea- 
sons as does the present work. In |42j the equivalent func- 
tional of i|V.l|) in information theory, expressed in terms of 
the corresponding probabilistic notions, is shown to be a 
metric. (Our Lemma lV. 41 implies this result, but obviously 
not the other way around.) 

This Work: We develop a general mathematical the- 
ory of similarity based on a notion of normalized distances. 
Suppose we define a new distance by setting the value be- 
tween every pair of objects to the minimal upper semi- 
computable (Definition III. 31 below) normalized distance 
(possibly a different distance for every pair). This new 
distance is a non-uniform lower bound on the upper semi- 
computable normalized distances. The central notion of 
this work is the "normalized information distance," given 
by a simple formula, that is a metric, belongs to the class of 
normalized distances, and minorizes the non-uniform lower 
bound above. It is (possibly) not upper semi-computable, 
but it is the first universal similarity measure, and is an ob- 
jective recursively invariant notion by the Church- Turing 
thesis We cannot compute the normalized informa- 

tion distance, which is expressed in terms of the noncom- 
putable Kolmogorov complexities of the objects concerned. 
Instead, we look at wether a real-world imperfect analogue 
works experimentally, by replacing the Kolmogorov com- 
plexities by the length of the compressed objects using 
real- world compressors like gzip or GenCompress. Here 
we show the results of experiments in the diverse areas of 
(i) bio-molecular evolution studies, and (ii) natural lan- 
guage evolution. In area (i): In recent years, as the com- 
plete genomes of various species become available, it has 
become possible to do whole genome phylogeny (this over- 
comes the problem that different genes may give different 
trees [§], 03)- However, traditional phylogenetic methods 
on individual genes depended on multiple alignment of the 
related proteins and on the model of evolution of individ- 
ual amino acids. Neither of these is practically applica- 
ble to the genome level. In this situation, a method that 
can compute shared information between two individual 
sequences is useful because biological sequences encode in- 
formation, and the occurrence of evolutionary events (such 
as insertions, deletions, point mutations, rearrangements, 
and inversions) separating two sequences sharing a common 
ancestor will result in partial loss of their shared informa- 
tion. Our theoretical approach is used experimentally to 
create a fully automated and reasonably accurate software 
tool based on such a distance to compare two genomes. We 
demonstrate that a whole mitochondrial genome phylogeny 



of the Eutherians can be reconstructed automatically from 
unaligned complete mitochondrial genomes by use of our 
software implementing (an approximation of) our theory, 
confirming one of the hypotheses in [5] . These experimen- 
tal confirmations of the effacity of our comprehensive ap- 
proach contrasts with recent more specialized approaches 
such as [5U] that have (and perhaps can) only be tested on 
small numbers of genes. They have not been experimen- 
tally tried on whole mitochondrial genomes that are, appar- 
ently, already numerically out of computational range. In 
area (ii) we fully automatically construct the language tree 
of 52 primarily Indo-European languages from translations 
of the "Universal Declaration of Human Rights" — leading 
to a grouping of language families largely consistent with 
current linguistic viewpoints. Other experiments and ap- 
plications performed earlier, not reported here are: detect- 
ing plagiarism in student programming assignments 
phylogeny of chain letters in |S] . 

Subsequent Work: The current paper can be viewed 
as the theoretical basis out of a trilogy of papers: In 15 
we address the gap between the rigorously proven optimal- 
ly of the normalized information distance based on the 
noncomputable notion of Kolmogorov complexity, and the 
experimental successes of the "normalized compression dis- 
tance" or "NCD" which is the same formula with the Kol- 
mogorov complexity replaced by the lengths in bits of the 
compressed files using a standard compressor. We provide 
an axiomatization of a notion of "normal compressor," and 
argue that all standard compressors, be it of the Lempel- 
Ziv type (gzip), block sorting type (bzip2), or statistical 
type (PPMZ), are normal. It is shown that the NCD based 
on a normal compressor is a similarity distance, satisfies 
the metric properties, and it approximates universality. To 
extract a hierarchy of clusters from the distance matrix, 
we designed a new quartet method and a fast heuristic to 
implement it. The method is implemented and available 
on the web as a free open-source software tool: the Com- 
pLearn Toolkit |13|. To substantiate claims of universality 
and robustness, |15j reports successful applications in ar- 
eas as diverse as genomics, virology, languages, literature, 
music, handwritten digits, astronomy, and combinations of 
objects from completely different domains, using statisti- 
cal, dictionary, and block sorting compressors. We tested 
the method both on natural data sets from a single domain 
and combinations of different domains (music, genomes, 
texts, executables, Java programs), and on artificial ones 
where we know the right answer. In |14| we applied the 
method in detail to to music clustering, (independently 35^ 
applied the method of [2] in this area). The method has 
been reported abundantly and extensively in the popular 
science press, for example [37], jS], ^7j, and has cre- 
ated considerable attention, and follow-up applications by 
researchers in specialized areas. One example of this is in 
parameter-free data mining and time series analysis |27|. 
In that paper the effacity of the compression method is 
evidenced by a host of experiments. It is also shown that 
the compression based method, is superior to any other 
method for comparision of heterogeneous files (for example 



IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO Y, MONTH 2004 



3 



time series), and anomaly detection, see Appendix IbI 

II. Preliminaries 

Distance and Metric: Without loss of generality, a 
distance only needs to operate on finite sequences of O's 
and l's since every finite sequence over a finite alphabet 
can be represented by a finite binary sequence. Formally, 
a distance is a function D with nonnegative real values, 
defined on the Cartesian product X x X of a set X. It is 
called a metric on X if for every i,i/,ze X: 

• D(x, y) = iff x = y (the identity axiom); 

• D(x, y) + D(y, z) > D{x, z) (the triangle inequality); 

• D(x,y) = D(y,x) (the symmetry axiom). 

A set X provided with a metric is called a metric space. 
For example, every set X has the trivial discrete metric 
D(x, y) = if x = y and D(x, y) — 1 otherwise. 

Kolmogorov Complexity: A treatment of the theory 
of Kolmogorov complexity can be found in the text 
Here we recall some basic notation and facts. We write 
string to mean a finite binary string. Other finite objects 
can be encoded into strings in natural ways. The set of 
strings is denoted by {0, 1}*. The Kolmogorov complexity 
of a file is essentially the length of the ultimate compressed 
version of the file. Formally, the Kolmogorov complexity, 
or algorithmic entropy, K{x) of a string x is the length of a 
shortest binary program x* to compute a; on an appropriate 
universal computer — such as a universal Turing machine. 
Thus, K(x) = \x\, the length of a;* [221; denotes the number 
of bits of information from which x can be computationally 
retrieved. If there are more than one shortest programs, 
then x* is the first one in standard enumeration. 

Remark II. 1: We require that there x can be decom- 
pressed from its compressed version x* by a general de- 
compressor program, but we do not require that x can be 
compressed to x* by a general compressor program. In 
fact, it is easy to prove that there does not exist such a 
compressor program, since K{x) is a noncomputable func- 
tion. Thus, K(x) serves as the ultimate, lower bound of 
what a real- world compressor can possibly achieve. <0> 

Remark II. 2: To be precise, without going in details, 
the Kolmogorov complexity we use is the "prefix" version, 
where the programs of the universal computer are prefix- 
free (no program is a proper prefix of another program). It 
is equivalent to consider the length of the shortest binary 
program to compute i in a universal programming lan- 
guage such as LISP or Java. Note that these programs are 
always prefix-free, since there is an end-of-program marker. 



The conditional Kolmogorov complexity K(x \ y) of x rel- 
ative to y is defined similarly as the length of a shortest 
program to compute x if y is furnished as an auxiliary in- 
put to the computation. We use the notation K{x,y) for 
the length of a shortest binary program that prints out x 
and y and a description how to tell them apart. The func- 
tions K(-) and K (■]■), though defined in terms of a par- 
ticular machine model, are machine-independent up to an 
additive constant and acquire an asymptotically universal 
and absolute character through Church's thesis, from the 



ability of universal machines to simulate one another and 
execute any effective process. 

Definition II. 3: A real-valued function f(x,y) is upper 
semi- computable if there exists a rational-valued recursive 
function g(x, y, t) such that (i) g{x, y, t+l) < g(x, y, t), and 
(ii) limt^oo g(x, y, t) = f(x, y). It is lower semi- computable 
if —f(x,y) is upper semi-computable, and it is computable 
if it is both upper- and lower semi-computable. 

It is easy to see that the functions K(x) and K (y \ x*) 
(and under the appropriate interpretation also x* , given 
x) are upper semi-computable, and it is easy to prove that 
they are not computable. The conditional information con- 
tained in x* is equivalent to that in (x,K(x)): there are 
fixed recursive functions /, g such that for every x we have 
f{x*) = (x,K(x)) and g{x, K(x)) = x* . The information 
about x contained in y is defined as I(y : x) — K(x) —K(x \ 
y*). A deep, and very useful, result [20] shows that there 
is a constant c\ > 0, independent of x, y, such that 

K(x, y) = K(x) + K(y | x*) = K(y) + K(x | y*), (III) 

with the equalities holding up to c\ additive precision. 
Hence, up to an additive constant term I(x : y) = I(y : x). 

Precision: It is customary in this area to use "additive 
constant c" or equivalently "additive 0(1) term" to mean 
a constant, accounting for the length of a fixed binary pro- 
gram, independent from every variable or parameter in the 
expression in which it occurs. 

III. Information Distance 

In our search for the proper definition of the distance 
between two, not necessarily equal length, binary strings, 
a natural choice is the length of the shortest program that 
can transform either string into the other one — both ways, 
0]. This is one of the main concepts in this work. For- 
mally, the information distance is the length E{x, y) of a 
shortest binary program that computes x from y as well 
as computing y from x. Being shortest, such a program 
should take advantage of any redundancy between the in- 
formation required to go from x to y and the information 
required to go from y to x. The program functions in a 
catalytic capacity in the sense that it is required to trans- 
form the input into the output, but itself remains present 
and unchanged throughout the computation. A principal 
result of 0] shows that the information distance equals 

E(x, y) = m& X {K(y | x), K(x | y)} (III.l) 

up to an additive 0(logmax{K(y \ x),K(x \ y)}) 
term. The information distance E(x,y) is upper semi- 
computable: By dovetailing the running of all programs 
we can find shorter and shorter candidate prefix-free pro- 
grams p with p{x) = y and p(y) = x, and in the limit 
obtain such a p with \p\ = E(x,y). (It is very important 
here that the time of computation is completely ignored: 
this is why this result does not contradict the existence of 
one-way functions.) It was shown in 0], Theorem 4.2, that 
the information distance E(x, y) is a metric. More pre- 
cisely, it satisfies the metric properties up to an additive 
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fixed finite constant. A property of E(x,y) that is central 
for our purposes here is that it minorizes every "admissible 
distance" (below) up to an additive constant. In defining 
the class of admissible distances we want to exclude unre- 



alistic distances like f(x,y) 



for every pair x ^ y, by 



restricting the number of objects within a given distance of 
an object. Moreover, we want distances to be computable 



m some manner. 



Definition 111.1: Let O = {0, 1}*. A function D : Q x 
57 — > 1Z + (where 1Z + denotes the positive real numbers) is 
an admissible distance if it is upper semi-computable, sym- 
metric, and for every pair of objects x, y £ fl the distance 
D(x,y) is the length of a binary prefix code- word that is 
a program that computes x from y, and vice versa, in the 
reference programming language. 

Remark III. 2: In 0] we considered "admissible metric" , 
but the triangle inequality metric restriction is not necesary 
for our purposes here. {} 
If D is an admissible distance, then for every x £ {0, 1}* 
the set {D(x,y) : y £ {0, 1}*} is the length set of a prefix 
code. Hence it satisfies the Kraft inequality 



£ 2 -IJ(x,t,) < 1; 



(III.2) 



which gives us the desired density condition. 

Example III. 3: In representing the Hamming distance d 
between x and y strings of equal length n differing in posi- 
tions ii, . . . ,id, we can use a simple prefix- free encoding of 
(n,d,ii, . . . ,id) m.H n {x,y) = 21ogn-(-41oglogn-|-2-|-dlogn 
bits. We encode n and d prefix- free in log n + 2 log log n + 1 
bits each, see e.g. and then the literal indexes of the 
actual flipped-bit positions. Hence, H n (x,y) is the length 
of a prefix code word (prefix program) to compute x from 
y and vice versa. Then, by the Kraft inequality, 



E 2 " 



(III.3) 



It is easy to verify that H n is a metric in the sense that 
it satisfies the metric (in)equalities up to 0(log n) additive 
precision. 
Theorem III. 4- The information distance E(x,y) is an 
admissible distance that satisfies the metric inequalities up 
to an additive constant, and it is minimal in the sense that 
for every admissible distance D(x,y) we have 

E{x,y)<D{x,y) + 0{\). 

Remark III. 5: This is the same statement as Theorem 
4.2 in 0j, except that there the D(x, y)'s were also required 
to be metrics. But the proof given doesn't use that re- 
striction and therefore suffices for the slightly more general 
theorem as stated here. () 

Suppose we want to quantify how much objects differ 
in terms of a given feature, for example the length in bits 
of files, the number of beats per second in music pieces, 
the number of occurrences of a given base in the genomes. 
Every specific feature induces a distance, and every spe- 
cific distance measure can be viewed as a quantification of 



an associated feature difference. The above theorem states 
that among all features that correspond to upper semi- 
computable distances, that satisfy the density condition 
(|III.2(l , the information distance is universal in that among 
all such distances it is always smallest up to constant preci- 
sion. That is, it accounts for the dominant feature in which 
two objects are alike. 

IV. Normalized Distance 

Many distances are absolute, but if we want to express 
similarity, then we are more interested in relative ones. For 
example, if two strings of length 10 6 differ by 1000 bits, 
then we are inclined to think that those strings are rela- 
tively more similar than two strings of 1000 bits that have 
that distance and 

Definition IV. 1: A normalized distance or similarity dis- 
tance, is a function d : Q x — > [0, 1] that is symmetric 
d(x,y) = d(y,x), and for every x £ {0, 1}* and every con- 
stant e £ [0, 1] 

\{y:d{x,y)<e<l}\<r K ^+ l . (IV.l) 
The density requirement (|IV.1(I is implied by a "normal- 
ized" version of the Kraft inequality: 

Lemma IV. 2: Let d : O X fl — » [0, 1] satisfy 



1 2-d(x,y)K(x) < ^ 



(IV.2) 



Then, d satisfies IjlV.lf) . 

Proof: For suppose the contrary: there is an e £ [0, 1], 
such that 



(|IV.1(I is false 
obtain a contradiction: 



i>2> 



Then, starting from (|IV.2|I we 



■d(x,y)K(x) 



y:d(x,y)<e< 1 



Remark IV. 3: If d[x,y) is a normalized version of an 
admissible distance D(x,y) with D(x,y)/d(x,y) > K(x), 





then ljIVT2|) implies (|HL2|l . 

We call a normalized distance a "similarity" distance, 
because it gives a relative similarity (with distance when 
objects are maximally similar and distance 1 when the are 
maximally dissimilar) and, conversely, for a well-defined 
notion of absolute distance (based on some feature) we can 
express similarity according to that feature as a similarity 
distance being a normalized version of the original absolute 
distance. In the literature a distance that expresses lack of 
similarity (like ours) is often called a "dissimilarity" dis- 
tance or a "disparity" distance. 

Example IV. 4-' The prefix-code for the Hamming dis- 
tance H n (x,y) between x,y £ {0,1}" in Example IIII.3I 
is a program to compute from x to y and vice versa. 
To turn it into a similarity distance define h n (x,y) = 
H n (x,y)/(a(x,y)nlogn) with a{x,y) satisfying the in- 
equality nH(ea(x, y)) < eK(x) for every < e < 1 and 
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< h(x,y) < 1 for every n,x,y, where this time H de- 
notes the entropy with two possibilities with probabilities 
p = en(x,y) and 1 — p, respectively. For example, for x 
with K{x) = n and y is within n/2 bit flips of x, we can 
set a(x,y) = ^, yielding h n (x,y) = 2d/n with d the num- 
ber of bit flips to obtain y from x. For every x, the number 
of y in the Hamming ball h n (x,y) < e is upper bounded 
by 2 nH ( ea ( x ' v ^ . By the constraint on a(x,y), the function 
h n (x,y) satisfies the density condition ljiV.l|) . () 

V. Normalized Information Distance 



d s (x,y) = 



K(x | y*) + K(y | x*) 



K( x, y) 

Writing it differently, using (111. If) . 



d s {x,y) = 1 



i( x '■ y) 

K(x,y)' 



(V.l) 



(V.2) 



d(x,y) 



max{K(x \ y*),K(y \ x*)} 



(V.3) 



max{K(x),K(y)} ' 
Remark V. 3: Several natural alternatives for the denom- 
inator turn out to be wrong: 

(a) Divide by the length. Then, firstly we do not know 
which of the two length involved to divide by, possibly the 
sum or maximum, but furthermore the triangle inequality 
and the universality (domination) properties are not satis- 
fied. 

(b) In the d definition divide by K(x,y). Then one has 
d(x,y) — i whenever x and y are random (have maximal 
Kolmogorov complexity) relative to one another. This is 
improper. 

(c) In the d s definition dividing by length does not satisfy 
the triangle inequality. 



There is a natural interpretation to d[x, y): If K(y) > 
K(x) then we can rewrite 



d(x,y) 



K(y) - I{x : y) 
K{y) 



H x : v) 

K(y) ' 



Clearly, unnormalized information distance l|III.lfl is not 
a proper evolutionary distance measure. Consider three 
species: E. coli, H. influenza, and some arbitrary bacte- 
ria X of similar length as H. influenza, but not related. 
Information distance d would have d(X, H. influenza) < 
d(E. coli, H. influenza), simply because of the length fac- 
tor. It would put two long and complex sequences that 
differ only by a tiny fraction of the total information as 
dissimilar as two short sequences that differ by the same 
absolute amount and are completely random with respect 
to one another. In [31] we considered as first attempt at a 
normalized information distance: 

Definition V.l: Given two sequences x and y, define the 
function d s (x,y) by 



where I{x : y) — K(y) — K{y \ x*) is known as the mutual 
algorithmic information. It is "mutual" since we saw from 
(|II.1(I that it is symmetric: I(x : y) = I(y : x) up to a 
fixed additive constant. This distance satisfies the trian- 
gle inequality, up to a small error term, and universality 
(below), but only within a factor 2. Mathematically more 
precise and satisfying is the distance: 

Definition V.2: Given two sequences x and y, define the 
function d(x, y) by 



That is, 1 — d(x, y) between x and y is the number of bits 
of information that is shared between the two strings per 
bit of information of the string with most information. 

Lemma V.4-' d(x,y) satisfies the metric (in)equalities up 
to additive precision 0(1/ K), where K is the maximum of 
the Kolmogorov complexities of the objects involved in the 
(in) equality. 

Proof: Clearly, d(x, y) is precisely symmetrical. It 
also satisfies the identity axiom up to the required preci- 
sion: 

d(x,x) = 0(1/ K(x)). 

To show that it is a metric up to the required precision, it 
remains to prove the triangle inequality. 

Claim V.5: d(x,y) satisfies the triangle inequality 
d(x,y) < d(x,z) + d(z,y) up to an additive error term 
of 0(l/max{K(x), K(y), K(z)}). 

Proof: Case 1: Suppose K(z) < m&x{K (x) , K (y)} . 
In [23, the following "directed triangle inequality" was 
proved: For all x, y, z, up to an additive constant term, 



K(x | y*) < K(x, z\y*)< K(x \ z*) + K(z \ y*). 

. (V ' 4) 

Dividing both sides by max-{K(x), K (y)}, majorizing and 
rearranging, 

max{if(x | y*),K(y \ x*)} 
maxji^a;), K (y)} 
_ man{K(x | z*) + K(z \ y*), K(y \ z*) + K(z \ x*)} 
max{A'(j:), K(y)} 
max{K(x \ z*), K(z \ x*)} max{K (z | y*), K(y \ z*)} 



< 



max{K(x), K(y)} 



max{K(x), K(y)} 



up to an additive term 0(1/ max{K(x), K(y), K(z)}). Re- 
placing K(y) by K(z) in the denominator of the first term 
in the right-hand side, and K(x) by K(z) in the denom- 
inator of second term of the right-hand side, respectively, 
can only increase the right-hand side (again, because of the 
assumption). 

Case 2: Suppose K(z) = ma.x{K (x) , K (y) , K (z)} . Fur- 
ther assume that K(x) > K(y) (the remaining case is sym- 
metrical). Then, using the symmetry of information to 
determine the maxima, we also find K(z \ x*) > K(x \ z*) 
and K(z \ y*) > K(y \ z*). Then the maxima in the terms 
of the equation d(x, y) < d(x, z) + d(y, z) are determined, 
and our proof obligation reduces to: 

K(x\y*) < K(z\x*) | K(z\y*) ^ 



K(x) 



K(z) 



K(z) ' 



up to an additive term 0(1/ K(z)). To prove (|V.5|I we 
proceed as follows: 



IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO Y, MONTH 2004 



6 



Applying the triangle inequality 
sides by K(x), we have 



4j and dividing both 



K(x | y*) K(x | z *) + K(z \ y*) + 0(1) 



K(x) 



K{x) 



(V.6) 



where the left-hand side is < 1. 

Case 2.1: Assume that the right-hand side is < 1. 
Setting K(z) = K( x) + A, and observe K (x\z*) + A = 
K(z\x*) + O(l) by (jlTTjl . Add A to both the numerator 
and the denominator in the right-hand side of l|V.6|) , which 
increases the right-hand side because it is a ratio < 1, and 
rewrite: 



K{x | y*) 
K(x) 



< 



K(x | z*) + K{z | y*) + A + 0(1) 

K{x) + A 
K{z | x*) + K(z | y*) + 0(1) 



K(z) 

which was what we had to prove. 

Case 2.2: The right-hand side is > 1. We proceed like in 
Case 2.1, and add A to both numerator and denominator. 
Although now the right-hand side decreases, it must still 
be > 1. This proves Case 2.2. ■ 

■ 

Clearly, d(x, y) takes values in the range [0, 1 + 
0(1/ max{K(x), K(y)})]. To show that it is a normalized 
distance, it is left to prove the density condition of Defini- 
tion nvij 

Lemma V.6: The function d(x,y) satisfies the density 
condition l|IV.l|) . 

Proof: Case 1: Assume K(y) < K(x). Then, 
d(x, y) = K(x | y*)/K{x). If d(x, y) < e, then K(x \ y*) < 
eK(x). Adding K{y) to both sides, rewriting according to 
(|II.1(I . and subtracting K(x) from both sides, we obtain 

K(y | a;*) < eK{x) + K{y) - K{x) < eK(x). (V.7) 

There are at most Yli=o ^ < 2 eK ^ +1 binary programs 
of length < eK(x). Therefore, for fixed x there are < 
2 eK(x)+i ob j ects y satisfying flTTjl . 

Case 2: Assume K(x) < K(y). Then, d(x,y) = K(y \ 
x*)/K(y). If d(x,y) < e, then 1V.7|) holds again. Together, 
Cases 1 and 2 prove the lemma. ■ 

Since we have shown that d(x,y) takes values in [0, 1], 
it satisfies the metric requirements up to the given addi- 
tive precision, and it satisfies the density requirement in 
Definition HV.1I it follows: 

Theorem V. 7: The function d(x, y) is a normalized dis- 
tance that satisfies the metric (in)equalities up to 0(1/ K) 
precision, where K is the maximum of the Kolmogorov 
complexities involved in the (in)equality concerned. 

Remark V.8: As far as the authors know, the idea of 
normalized metric is not well-studied. An exception is |52| . 
which investigates normalized metrics to account for rela- 
tive distances rather than absolute ones, and it does so for 
much the same reasons as in the present work. An example 
there is the normalized Euclidean metric \x — y\J (\x\ + \y\), 
where x, y G lZ n (1Z denotes the real numbers) and | • | 



is the Euclidean metric — the norm. Another example 
is a normalized symmetric-set-difference metric. But these 
normalized metrics are not necessarily effective in that the 
distance between two objects gives the length of an effec- 
tive description to go from either object to the other one. 





VI. Universality 

We now show that d(x, y) is universal then it incorpo- 
rates every upper semi-computable (Definition III. 31) simi- 
larity in that if objects x, y are similar according to a par- 
ticular feature of the above type, then they are at least that 
similar in the d(x, y) sense. We prove this by demonstrating 
that d(x, y) is at least as small as any normalized distance 
between a;, y in the wide class of upper semi-computable 
normalized distances. This class is so wide that it will cap- 
ture everything that can be remotely of interest. 

Remark VIA: The function d(x,y) itself, being a ratio 
between two maxima of pairs of upper semi-computable 
functions, may not itself be semi-computable. (It is easy 
to see that this is likely, but a formal proof is difficult.) 
In fact, d(x,y) has ostensibly only a weaker computability 
property: Call a function f(x,y) computable in the limit 
if there exists a rational- valued recursive function g(x, y, t) 
such that linit^oo g(x,y,t) = f{x,y)- Then d(x,y) is in 
this class. It can be shown j22 that this is precisely the 
class of functions that are Turing-reducible to the halting- 
set. While c?(x, y) is possibly not upper semi-computable, 
it captures all similarities represented by the upper semi- 
computable normalized distances in the class concerned, 
which should suffice as a theoretical basis for all practical 
purposes. <^> 

Theorem VI. 2: The normalized information distance 
d(x,y) minorizes every upper semi-computable normal- 
ized distance f(x,y) by d(x,y) < f(x,y) + 0(1/ K) where 
K = rmn{K(x),K(y)}. 

Proof: Let be a pair of objects and let / be a 
normalized distance that is upper semi-computable. Let 
f(x,y) = e. 

Case 1: Assume that K(x) < K(y). Then, given 
x we can recursively enumerate the pairs x, v such that 
f(x,v) < e. Note that the enumeration contains x,y. By 
the normalization condition 1IV.1|I . the number of pairs 
enumerated is less than 2 eK ^ +1 . Every such pair, in 
particular x, y, can be described by its index of length 
< eK(x) + 1 in this enumeration. Since the Kolmogorov 
complexity is the length of the shortest effective descrip- 
tion, given x, the binary length of the index plus an O(l) 
bit program to perform the recovery of y, must at least 
be as large as the Kolmogorov complexity, which yields 
K(y | x) < eK(x) + 0(1). Since K(x) < K(y), by 
ijlTTJl . K(x I y*) < K(y \ x*), and hence d(x,y) = K(y \ 
x*)/K(y). Note that K(y \ x*) < K(y \ x) + 0(l), because 
x* supplies the information (x, K(x)) which includes the 
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information x. Substitution gives: 



d(x,y) 



K(y\x*) eK{x) + 0(l) 



K(y) 
< f{x,y)-\ 



< 



K(x) 
0(1/K(x)). 



Case 2: Assume that K(x) > K(y). Then, given 
y we can recursively enumerate the pairs u, y such that 
f(u,y) < e. Note that the enumeration contains x, y. By 
the normalization condition I jlV.ljl . the number of pairs 
enumerated is less than 2 eK ^ +1 . Every such pair, in 
particular x, y, can be described by its index of length 
< eK(y) + 1 in this enumeration. Similarly to Case 1, 
this yields K(x | y) < eK{y) + 0(1). Also, by Ipljl . K(y | 
x*) < K(x | y*), and hence d(x,y) — K(x \ y*)/K(x). 
Substitution gives: 



d(x,y) 



K(x\y*) ^ eK(y) + 0(l) 



K(x) 
< f(x,y)-\ 



< 



K{y) 
0(l/K{y)). 




Fig. 1 

The evolutionary tree built from complete mammalian 
mtdna sequences using frequency of fc-mers. 



VII. Application to Whole Mitochondrial 
Genome Phylogeny 

It is difficult to find a more appropriate type of object 
than DNA sequences to test our theory: such sequences 
are finite strings over a 4-letter alphabet that are natu- 
rally recoded as binary strings with 2 bits per letter. We 
will use whole mitochondrial DNA genomes of 20 mammals 
and the problem of Eutherian orders to experiment. The 
problem we consider is this: It has been debated in biology 
which two of the three main groups of placental mammals, 
Primates, Ferungulates, and Rodents, are more closely re- 
lated. One cause of debate is that the maximum likelihood 
method of phylogeny reconstruction gives (Ferungulates, 
(Primates, Rodents)) grouping for half of the proteins in 
mitochondial genome, and (Rodents, (Ferungulates, Pri- 
mates)) for the other half [S]- The authors aligned 12 con- 
catenated mitochondrial proteins taken from the following 
species: rat (Rattus norvegicus), house mouse (Mus mus- 
culus), grey seal (Halichoerus grypus), harbor seal (Phoca 
vitulina), cat (Felis catus), white rhino (Ceratotherium si- 
mum), horse (Equus caballus), finback whale (Balaenoptera 
physalus), blue whale (Balaenoptera musculus), cow (Bos 
taurus), gibbon (Hylobates lar), gorilla (Gorilla gorilla), 
human (Homo sapiens), chimpanzee (Pan troglodytes), 
pygmy chimpanzee (Pan paniscus), orangutan (Pongo pyg- 
maeus), Sumatran orangutan (Pongo pygmaeus abelii), us- 
ing opossum (Didelphis virginiana), wallaroo (Macropus 
robustus) and platypus (Ornithorhynchus anatinus) as the 
outgroup, and built the maximum likelihood tree. The 
currently accepted grouping is (Rodents, (Primates, Fer- 
ungulates)). 

A. Alternative Approaches: 

Before applying our theory, we first examine the alterna- 
tive approaches, in addition to that of [Jj]. The mitochon- 
drial genomes of the above 20 species were obtained from 



GenBank. Each is about 18k bases, and each base is one 
out of four types: adenine (A), which pairs with thymine 
(T), and cytosine (C), which pairs with guanine (G). 

k-mer Statistic: In the early years, researchers ex- 
perimented using G+C contents, or slightly more general 
fc-mers (or Shannon block entropy) to classify DNA se- 
quences. This approach uses the frequency statistics of 
length k substrings in a genome and the phylogeny is con- 
structed accordingly. To re-examine this approach, we per- 
formed simple experiments: Consider all length k blocks in 
eachmtDNA, for k = 1, 2, . . . , 10. There are I = (4 n -l)/3 
different such blocks (some may not occur). We com- 
puted the frequency of (overlapping) occurrences of each 
block in each mtDNA. This way we obtained a vector of 
length I for each mtDNA, where the ith entry is the fre- 
quency with which the zth block occurs overlapping in the 
mtDNA concerned (1 < i < I). For two such vectors (rep- 
resenting two mtDNAs) p, q, their distance is computed as 
d(p,q) — y/(p — q) T (p — q)- Using neighbor joining |45| . 
the phylogeny tree that resulted is given in Figure ^ Us- 
ing the hypercleaning method [B], we obtain equally absurd 
results. Similar experiments were repeated for size k blocks 
alone (for k = 10, 9, 8, 7, 6), without much improvement. 

Gene Order: In [7] the authors propose to use the or- 
der of genes to infer the evolutionary history. This ap- 
proach does not work for closely related species such as 
our example where all genes are in the same order in the 
mitochondrial genomes in all 20 species. 

Gene Content: The gene content method, proposed 
in ^5], |3§|, uses as distance the ratio between the num- 
ber of genes two species share and the total number of 
genes. While this approach does not work here due to the 
fact that all 20 mammalian mitochondrial genomes share 
exactly the same genes, notice the similarity of the gene 
content formula and our general formula. 
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Rearrangement Distance: Reversal and rearrange- 
ment distances in j2H], [25], EH compare genomes using 
other partial genomic information such as the number of 
reversals or translocations. These operations also do not 
appear in our mammalian mitochondrial genomes, hence 
the method again is not proper for our application. 

Transformation Distance or Compression Dis- 
tance: The transformation distance proposed in and 
compression distance proposed in |24| essentially corre- 
spond to K{x\y) which is asymmetric, and so they are not 
admissible distances. Using K(x\y) in the GenCompress 
approximation version produces a wrong tree with one of 
the marsupials mixed up with ferungulates (the tree is not 
shown here). 

B. Our Compression Approach 

We have shown that the normalized information distance 
d (and up to a factor 2 this holds also for d s ) is universal 
among the wide class normalized distances, including all 
computable ones. These universal distances (actually, met- 
rics) between x and y are expressed in terms of K(x), K(y), 
and K(x \ y). The generality of the normalized informa- 
tion distance d comes at the price of noncomputability: 
Kolmogorov complexity is not computable but just upper 
semi-computable, Section [H] and d itself is (likely to be) 
not even that. Nonetheless, using standard compressors, 
we can compute an approximation of d. 

Remark VII. 1: To prevent confusion, we stress that, in 
principle, we cannot determine how far a computable ap- 
proximation of K(x) exceeds its true value. What we can 
say is that if we flip a sequence x of n bits with a fair 
coin, then with overwhelming probability we will have K(x) 
is about n and a real compressor will also compress x to 
a string of about length n (that is, it will not compress 
at all and the compressed file length is about the Kol- 
mogorov complexity and truely approximates it). How- 
ever, these strings essentially consist of random noise and 
have no meaning. But if we take a meaningful string, for 
example the first 10 23 bits of the binary representation of 
7r = 3.1415..., then the Kolmogorov complexity is very 
short (because a program of, say, 10,000 bits can compute 
the string), but no standard compressor will be able to com- 
press the string significantly below its length of 10 23 (it will 
not be able to figure out the inherent regularity). And it is 
precisely the rare meaningful strings, rare in comparison to 
the overwhelming majority of strings that consist of ran- 
dom noise, that we can be possibly interested in, and for 
which the Kolmogorov complexity depends on computable 
regularities. Certain of those regularities may be easy to 
determine, even by a simple compressor, but some regu- 
larities may take an infeasible amount of time to discover. 



It is clear how to compute the real-world compressor 
version of the unconditional complexities involved. With 
respect to the conditional complexities, by 1)11.1(1 we have 
K(x | y) = K(x,y) — K(y) (up to an additive constant), 
and it is easy to see that K{x,y) = K(xy) up to additive 
logarithmic precision. (Here K(xy) is the length of the 



shortest program to compute the concatenation of x and y 
without telling which is which. To retrieve (x, y) we need 
to encode the separator between the binary programs for 
x and y.) So K(x \ y) is roughly equal to K{xy) — K(y). 

In applying the approach in practice, we have to make 
do with an approximation based on a real- world reference 
compressor C. The resulting applied approximation of the 
"normalized information distance" d is called the normal- 
ized compression distance (NCD) 



NCD(x,y) 



C(xy)-mm{C(x),C(v)} 
msx{C(x),C(v)} 



(VII.l) 



Here, C(xy) denotes the compressed size of the concatena- 
tion of x and y, C{x) denotes the compressed size of x, and 
C(y) denotes the compressed size of y. The NCD is a non- 
negative number < r < 1 + e representing how different 
the two files are. Smaller numbers represent more similar 
files. The e in the upper bound is due to imperfections in 
our compression techniques, but for most standard com- 
pression algorithms one is unlikely to see an e above 0.1 (in 
our experiments gzip and bzip2 achieved NCD's above 1, 
but PPMZ always had NCD at most 1). 

The theory as developed for the Kolmogorov-complexity 
based "normalized information distance" in this paper does 
not hold directly for the (possibly poorly) approximating 
NCD. In ^5], we developed the theory of NCD based on the 
notion of a "normal compressor," and show that the NCD 
is a (quasi-) universal similarity metric relative to a nor- 
mal reference compressor C . The NCD violates metricity 
only in sofar as it deviates from "normality," and it vio- 
lates universality only insofar as C(x) stays above K(x). 
The theory developed in the present paper is the bound- 
ary case C — K, where the "partially violated univer- 
sality" has become full "universality" . The conditional 
C(y\x) has been replaced by C(xy) — C{x), which can 
be interpreted in stream-based compressors as the com- 
pression length of y based on using the "dictionary" ex- 
tracted from x. Similar statments hold for block sorting 
compressors like bzip2, and designer compressors like Gen- 
Compress. Since the writing of this paper the method 
has been released in the public domain as open-source 
software at http://complearn.sourceforge.net/: The Com- 
pLearn Toolkit is a suite of simple utilities that one can use 
to apply compression techniques to the process of discov- 
ering and learning patterns. The compression-based ap- 
proach used is powerful because it can mine patterns in 
in completely different domains. In fact, this method is 
so general that it requires no background knowledge about 
any particular subject area. There are no domain-specific 
parameters to set, and only a handful of general settings. 

Number of Different k-mers: We have shown that 
using fc-mer frequency statistics alone does not work well. 
However, let us now combine the fc-mer approach with 
the incompressibility approach. Let the number of dis- 
tinct, possibly overlapping, fc-length words in a sequence 
x be N{x). With fc large enough, at least log a (n), where 
a is the cardinality of the alphabet and n the length of 
x, we use N(x) as a rough approximation to K(x). For 



IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO Y, MONTH 2004 



9 



example, for a sequence with the repetition of only one 
letter, this N(x) will be 1. The length k is chosen such 
that: (i) If the two genomes concerned would have been 
generated randomly, then it is unlikely that they have a 
fc-length word in common; and (ii) It is usual that two 
homologous sequences share the same fc-length words. A 
good choice is k = 0(log 4 n), where n is the length of 
the genomes and 4 is because we have 4 bases. There 
are 4 log4 n = n subwords because the alphabet has size 
4 for DNA. To describe a particular choice of N subwords 
of length k = log 4 n in a string of length n we need ap- 
proximately log(^) = Nlog((A k )/N) = 2kN - NlogN 
bits. For a family of mitochondrial DNA, we typically have 
5, 000 < N, n < 20, 000. In this range, 2kN -NlogN can 
be approximated by cN for some constant c. So, overall the 
number of different subwords of length k is proportional to 
TV for this choice of parameters. 

According to our experiment, k should be slightly larger 
than logn. For example, a mitochondrial DNA is about 
17K bases long. log 4 17000 = 7.02, while the k we use below 
is in range of 6, . . . , 13, 7, . . . , 13, or 8, . . . , 13, according to 
different formula and whether spaced seeds (see below) are 
used. 

We justify the complexity approximation using the num- 
ber of different fc-mers by the pragmatic observation that, 
because the genomes evolve by duplications, rearrange- 
ments and mutations, |44| . and assuming that duplicated 
subwords are to be regarded as duplicated information that 
can be "compressed out," while distinct subwords are not 
"compressed out," it can be informally and intuitively ar- 
gued that a description of the set of different subwords 
describes x. With our choice of parameters it therefore is 
appropriate to use N(x) as a plausible proportional esti- 
mate for K(x) in case a; is a genome. So the size of the 
set is used to replace the K(x) of genome x. K(x,y) is 
replaced by the size of the union of the two subword sets. 
Define N(x\y) as N(xy)—N(y). Given two sequences x and 
y, following the definition of d, l|V.3p . the distance between 
x and y can be defined as 



d'(x,y) = 



t{N(x\y),N(y\x)} 



(VII.2) 



max{AT(o:), N(y)} 

Similarly, following d s , IV. 1|) we can also define another 
distance using N(x), 



d*(x,y) 



N(x\y)+N(y\x) 
N{xy) 



(VII.3) 



Using d' and d* , we computed the distance matrixes for the 
20 mammal mitochondrial DNAs. Then we used hyper- 
Cleaning [H] to construct the phylogenies for the 20 mam- 
mals. Using either of d! and d*, we were able to construct 
the tree correctly when 8 < k < 13, as in Figure 03 A tree 
constructed with d! for k = 7 is given in Figure We note 
that the opossum and a few other species are misplaced. 
The tree constructed with d* for k — 7 is very similar, but 
it correctly positioned the opossum. 

Number of Spaced fc-mers In methods for doing DNA 
homology search, a pair of identical words, each from a 
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Fig. 2 

The evolutionary tree built from complete mammalian 
mtDNA sequences using block SIZE k = 7 AND d! . 



DNA sequence, is called a "hit". Hits have been used as 
"seeds" to generate a longer match between the two se- 
quences. If we define N{x\y) as the number of distinct 
words that are in x and not in y, then the more hits the 
two sequences have, the smaller the N(x\y) and N(y\x) are. 
Therefore, the !|VII.2|) . !|VII.3|1 distances can also be inter- 
preted as a function of the number of hits, each of which 
indicates some mutual information of the two sequences. 

As noticed by the authors of [HSj) though it is difficult 
to get the first hit (of k consecutive letters) in a region, 
it only requires one more base match to get a second hit 
overlapping the existing one. This makes it inaccurate to 
attribute the same amount of information to each of the 
hits. For this reason, we also tried to use the "spaced 
model" introduced in 3(i to compute our distances. A 
length-L, weight-fc spaced template is a 0-1 string of length 
L having k entries 1. We shift the template over the DNA 
sequence, one position each step, starting with the first po- 
sitions aligned and finishing with the last positions aligned. 
At each step extract the ordered sequence of the k bases in 
the DNA sequence covered by the 1-positions of the tem- 
plate to form a length-fc word. The number of different 
such words is then used to define the distances d! and d* 
in Formula l|VT|l and (|VII.3jl . 

We applied the new defined distances to the 20 mammal 
data. The performance is slightly bettern than the perfor- 
mance of the distances defined in IjV.ll) and l|VH.3(l . The 
modified d! and d* can correctly construct the mammal 
tree when 7 < k < 13 and 6 < k < 13, respectively. 

Compression: To achieve the best approximation of 
Kolmogorov complexity, and hence most confidence in the 
approximation of d s and d, we used a new version of 
the GenCompress program, [T2\, which achieved the best 
compression ratios for benchmark DNA sequences at the 
time of writing. GenCompress finds approximate matches 
(hence edit distance becomes a special case), approximate 
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Fig. 3 

The evolutionary tree built from complete mammalian 
mtDNA sequences. 



reverse complements, among other things, with arithmetic 
encoding when necessary. Online service of GenC'ompress 
can be found on the web. We computed d(x, y) between 
each pair of mtDNA x and y, using GenCompress to heuris- 
tically approximate K(x\y), K(x), and K(x,y), and con- 
structed a tree (Figure using the neighbor joining 05] 
program in the MOLPHY package pQ. The tree is iden- 
tical to the maximum likelihood tree of Cao, et al. [§]. 
For comparison, we used the hypcrclcaning program 8 
and obtained the same result. The phylogeny in Figure 01 
re-confirms the hypothesis of (Rodents, (Primates, Ferun- 
gulates)). Using the d s measure gives the same result. 

To further assure our results, we have extracted only 
the coding regions from the mtDNAs of the above species, 
and performed the same computation. This resulted in the 
same tree. 

Remark VII. 2: In JS] we have repeated these phylogeny 
experiments using bzip2 and PPMZ compressors, and a 
new quartet method to reconstruct the phylogeny tree. In 
all cases we obtained the correct tree. This is evidence that 
the compression NCD method is robust under change of 
compressors, as long as the window size of the used com- 
pressor is sufficient for the files concerned, that is, Gen- 
Compress can be replaced by other more general-purpose 
compressors. Simply use 

Evaluation: This new method for whole genome com- 
parison and phylogeny does not require gene identification 
nor any human intervention, in fact, it is totally automatic. 
It is mathematically well-founded being based on general 
information theoretic concepts. It works when there are no 
agreed upon evolutionary models, as further demonstrated 
by the successful construction of a chain letter phylogeny 
[1] and when individual gene trees do not agree (Cao et 
al., [5]) as is the case for genomes. As a next step, using 
the approach in we have applied this method to much 
larger nuclear genomes of fungi and yeasts. This work is 
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Fig. 4 

The language tree using approximated normalized 
information distance, <2 s -version iv. 11 . and neighbor ioining. 



not reported yet. 

VIII. The Language Tree 

Normalized information distance is a totally general uni- 
versal tool, not restricted to a particular application area. 
We show that it can also be used to successfully classify 
natural languages. We downloaded the text corpora of 
"The Universal Declaration of Human Rights" in 52 Euro- 
Asian languages from the United Nations website |23 : . All 
of them are in UNICODE. We first transform each UNI- 
CODE character in the language text into an ASCII char- 
acter by removing its vowel flag if necessary. Secondly, 
as compressor to compute the NCD we used a Lempel-Ziv 
compressor gzip. This seems appropriate to compress these 
text corpora of sizes (2 kilobytes) not exceeding the length 
of sliding window gzip uses (32 kilobytes). In the last step, 
we applied the d s -metric (|V.1|) with the neighbor-joining 
package to obtain Figure fVIIII Even better worked apply- 
ing the d-metric (|V.3ll with the Fitch-Margoliash method 
[TH| in the package PHYLIP p^); the resulting language 
classification tree is given in Figure fVIIII We note that all 
the main linguistic groups can be successfully recognized, 
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The language tree using approximated normalized 
information distance, d- version iv. 31 . and the 
Fitch-Margoliash METHOD. 



which includes Romance, Celtic, Germanic, Ugro-Finnic, 
Slavic, Baltic, Altaic as labeled in the figure. In both cases, 
it is a rooted tree using Basque [Spain] as outgroup. The 
branch lengths are not proportional to the actual distances 
in the distance matrix. 

Any language tree built by only analyzing contempo- 
rary natural text corpora is partially corrupted by histor- 
ical inter-language contaminations. In fact, this is also 
the case with genomic evolution: According to current in- 
sights phylogenetic trees are not only based on inheritance, 
but also the environment is at work through selection, and 
this even introduces an indirect interation between species, 
called reticulation 1 (arguably less direct than de borrow- 
ings between languages). Thus, while English is ostensibly 
a Germanic Anglo-Saxon language, it has absorbed a great 
deal of French-Latin components. Similarly, Hungarian, 
often considered a Finn-Ugric language, which consensus 
currently happens to be open to debate in the linguistic 
community, is known to have absorbed many Turkish and 

1 Joining of separate lineages on a phylogenetic tree, generally 
through hybridization or through lateral gene transfer. Fairly com- 
mon in certain land plant clades; reticulation is thought to be rare 
among metazoans. 6 



Slavic components. Thus, an automatic construction of 
a language tree based on contemporary text corpora, ex- 
hibits current linguistic relations which do not necessarily 
coincide completely with the historic language family tree. 
The misclassification of English as Romance language is 
reenforced by the fact that the English vocabulary in the 
Universal Declaration of Human Rights, being nonbasic in 
large part, is Latinate in large part. This presumably also 
accounts for the misclassification of Maltese, an Arabic di- 
alect with lots of Italian loan words, as Romance. Having 
voiced these caveats, the result of our automatic experi- 
ment in language tree reconstruction is accurate. 

Our method improves the results of J2j, using the same 
linguistic corpus, using an asymmetric measure based on 
the approach sketched in the section "Related Work." In 
the resulting language tree, English is isolated between Ro- 
mance and Celtic languages, Romani-balkan and Albanian 
are isolated, and Hungarian is grouped with Turkish and 
Uzbek. The (rooted) trees resulting from our experiments 
(using Basque as out-group) seem more correct. We use 
Basque as outgroup since linguists regard it as a language 
unconnected to other languages. 

IX. Conclusion 

We developed a mathematical theory of compression- 
based similarity distances and shown that there is a univer- 
sal similarity metric: the normalized information distance. 
This distance uncovers all upper semi-computable similar- 
ities, and therefore estimates an evolutionary or relation- 
wise distance on strings. A practical version was exhibited 
based on standard compressors. Here it has been shown 
to be applicable to whole genomes, and to built a large 
language family tree from text corpora. References to ap- 
plications in a plethora of other fields can be found in 
the Introduction. It is perhaps useful to point out that 
the results reported in the figures were obtained at the 
very first runs and have not been selected by appropri- 
ateness from several trials. From the theory point-of-view 
we have obtained a general mathematical theory forming 
a solid framework spawning practical tools applicable in 
many fields. Based on the noncomputable notion of Kol- 
mogorov complexity, the normalized information distance 
can only be approximated without convergence guarantees. 
Even so, the fundamental Tightness of the approach is evi- 
denced by the remarkable success (agreement with known 
phytogeny in biology) of the evolutionary trees obtained 
and the building of language trees. From the applied side 
of genomics our work gives the first fully automatic gen- 
eration of whole genome mitochondrial phytogeny; in com- 
putational linguistics it presents a fully automatic way to 
build language trees and determine language families. 

Appendix 

I. A Variant Method in Linguistics 

In [2] the purpose is to infer a language tree from 
different-language text corpora, as well as do authorship 
attribution on basis of text corpora. The distances deter- 
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mined between objects are justified by ad-hoc plausibil- 
ity arguments (although the information distance of [&-t| . 
0] is also mentioned). The paper |2] is predated by our 
universal similarity metric work and phylogeny tree (hier- 
archical clustering) experiments |TT], ^2]j EH]j but it is 
the language tree experiment we repeated in the present 
paper using our own technique with somewhat better re- 
sults. For comparison of the methods we give some brief 
details. Assume a fixed compressor (0, [H] use the Lempel- 
Ziv type). Let C(x) denote the length of of the com- 
pressed version of a file x, and let x' be a short file 
from the same source as x. For example if x is a long 
text in a language, then x' is a short text in the same 
language. (The authors refer to sequences generated by 
the same ergodic source.) Then two distances are con- 
sidered between files x,y: (i) the asymmetric distance 
s(x,y) = ([C(xy') - C(x)]- [C{yy>) - C(y)})/\y'\, the nu- 
merator quantifying the difference in compressing y' using 
a data base sequence generated by a different source versus 
one generated by the same source that generated y'\ and 
a symmetric distance (ii) S(x,y) = s(x,y)\y'\/[C(yy') — 
C{y)\ + s(y,x)\x'\/[C(xx') — C{x)\. The distances are not 
metric (neither satisfies the triangular inequality) and the 
authors propose to "triangularize" in practice by a Pro- 
crustes method: setting S (x, y) := min w (S(x, w) + S(w, y)) 
in case the left-hand side exceeds the right-hand side. We 
remark that in that case the left-hand side S(x, y) becomes 
smaller and may in its turn cause a violation of another tri- 
angular inequality as a member of the right-hand side, and 
so on. On the upside, despite the lack of supporting theory, 
the authors report successful experiments. 

II. A Variant Method in Data Mining 

In the follow-up data mining paper |27| the authors re- 
port successful experiments using a simplified version of the 
NCD IjVII.ll) called compression-based dissimilarity mea- 
sure (CDM): 



CDM(x,y) = 



C(xy) 



C(x)+C(yY 



Note that this measure always ranges between i (for x = y) 
and 1 (for x and y satisfy C(xy) — C(x) + C(y), that is, 
compressing x doesn't help in compressing y). The au- 
thors don't give a theoretical analysis, but intuitively this 
formula measures similarity of x and y by comparing the 
lengths of the compressed files in combination and seper- 
ately. 
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